Introduction

Our Question

text

Original Study

Below is a visual that shows the type of study we originally wanted to conduct. Their data is not open source and because finding data of this type was not possible we changed the data and remainder of the project.

We can see a rough answer to our original question, “When is the cheapest time to buy airline tickets? Does pricing change significantly as demand for flights varies? Do airlines vary price to combat strategic consumers?”… Yes ticket price is extremely elastic.

Cheapair.com

Data Source: Source


Literature Review

  • Higher cost airlines are inclined to determine which market consumers belong to, tourism or business.

  • Last minute deals are more than offset by than increased prices leading up to them.

  • Price discrimination while difficult for airlines to execute still occurs.

In summary of the first two points, studies find that airlines should charge lower fares to tourists buying tickets one to two periods in advance, charge higher fares to businessmen the period of a flight, and cut costs on the day of a flight, in order to maximize their profits.

We find evidence of airline discrimination in other studies, some of which show that airline tickets will have higher fares for weekday flights than weekend flights in an attempt to price discriminate against people going on business trips.


Data Analysis

All data was obtain via the US Federal Aviation Database systems. Source

Due to the size of the data this document is created with code that randomly samples 20,000 points from our 11 million, as a result the charts below sometimes vary thus we refrained from interpreting them. Although the full data set is represented by the tabular outputs. Additionally when viewing regression outputs we were able to code in references to outputs as the values change while interpretations are static. The patterns and relationships observed remain constant across samples due to the strength and significance of our variables as well as the large sample size

Variable Definition

Variable Name Description
QUARTER Quarter (1-4)
ROUNDTRIP Round Trip Indicator (1=Yes)
ITIN_YIELD Itinerary Fare Per Miles Flown in Dollars (ITIN_FARE/MilesFlown).
PASSENGERS Number of Passengers
ITIN_FARE Itinerary Fare Per Person
DISTANCE_GROUP Distance Group, in 500 Mile Intervals
MILES_FLOWN Itinerary Miles Flown (Track Miles)
ITIN_GEO_TYPE Itinerary Geography Type, 0 = Contiguous Domestic (Lower 48 U.S. States Only) , 1 = Non-contiguous Domestic (Includes Hawaii, Alaska and Territories)
tabl3 <-"
| Transformed Variable Name | Original Variable Name | Description  | 
|--------------------|:---------------:|:--------------------------:|
| lPASSENGERS        | PASSENGERS    |  Log(PASSENGERS)   |
| SQRT_1over_DG_x_MF | DISTANCE_GROUP & MILES_FLOWN |   $\\sqrt{\\frac{1}{\\text{DISTANCE_GROUP} * \\text{MILES_FLOWN}}}$  |         
"


tabl3 %>% pander()
Transformed Variable Name Original Variable Name Description
lPASSENGERS PASSENGERS Log(PASSENGERS)
SQRT_1over_DG_x_MF DISTANCE_GROUP & MILES_FLOWN \(\sqrt{\frac{1}{\text{DISTANCE_GROUP} * \text{MILES_FLOWN}}}\)

Graphical Summaries

Overview

Yield by log(Passengers)

ggplot(samp %>% drop_na()) +
  geom_smooth(aes(x= lPASSENGERS, y=ITIN_YIELD, col=ROUNDTRIP)) +
  #geom_jitter(aes(x= lPASSENGERS, y=ITIN_YIELD, col=ROUNDTRIP), alpha= 0.05) +
  facet_grid(rows= ~ITIN_GEO_TYPE)+
  theme_bw()+
  labs(col = "Flight Type", title= "Yield by log(Passengers)")+ 
  xlab("Log(Passengers)") + ylab("Fare per mile per passenger (Dollars)")

Yield by Distance Groups

ggplot(samp %>% drop_na()) +
  #geom_jitter(aes(x= DISTANCE_GROUP, y=ITIN_YIELD, col=ROUNDTRIP), alpha= 0.0075) +
  geom_smooth(aes(x= DISTANCE_GROUP, y=ITIN_YIELD, col=ROUNDTRIP)) +
  facet_grid(rows= ~ITIN_GEO_TYPE)+
  theme_bw()+
  theme(
      panel.spacing = unit(0.5, "lines")
    )+ 
  labs(col = "Flight Type", title= "Yield by Distance")+ 
  xlab("Distance in intervals of 500") + ylab("Fare per mile per passenger (Dollars)")

Yield by Binaries

ggplot(data=samp, ) +
    geom_histogram(aes(x=ITIN_YIELD, fill= ROUNDTRIP)) +
    #geom_area(aes(x=HEPerGDP,y=child_mort, fill= continent))+
    theme_bw() +
    gghighlight(use_direct_label = FALSE) +
    facet_wrap(~ITIN_GEO_TYPE) +
    theme(
      panel.spacing = unit(0.5, "lines"),
      axis.ticks.x=element_blank()
    )+ 
  labs(fill = "Flight Type", title= "Distribution of Yields by Flight Types")+ 
  xlab("Fare per mile per passenger (Dollars)") + ylab("Count") 

Tabular Summaries

Overview

Yield ~ RoundTrip & Geo

pander(favstats(ITIN_YIELD ~  ROUNDTRIP + ITIN_GEO_TYPE, data=FullDat_Filt)[c("ROUNDTRIP.ITIN_GEO_TYPE", "Q1","median", "mean","Q3", "sd","n")], caption= "Summary table of Yields by Flight Type per Quarter")
Summary table of Yields by Flight Type per Quarter (continued below)
ROUNDTRIP.ITIN_GEO_TYPE Q1 median mean Q3 sd
One-Way.Continguous Domestic 0.1047 0.1738 0.2354 0.2908 0.2091
RoundTrip.Continguous Domestic 0.1015 0.1599 0.2063 0.2571 0.1657
One-Way.Non-Continguous Domestic 0.0709 0.1014 0.1423 0.1586 0.148
RoundTrip.Non-Continguous Domestic 0.0681 0.0942 0.1246 0.1337 0.1322
n
4025439
5803943
338790
438735

Yield ~ Distance Group

pander(favstats(ITIN_YIELD ~ DISTANCE_GROUP, data=FullDat_Filt)[c(1:5, 12:16, 23:25),c("DISTANCE_GROUP", "Q1","median", "mean","Q3", "sd","n")], caption= "Summary table of Yields by Flight Type per Quarter")
Summary table of Yields by Flight Type per Quarter
  DISTANCE_GROUP Q1 median mean Q3 sd n
1 1 0.3347 0.5259 0.6113 0.8075 0.3752 420249
2 2 0.1809 0.2847 0.3348 0.4358 0.2166 1675504
3 3 0.1301 0.2045 0.2357 0.3043 0.1472 1941759
4 4 0.1071 0.1659 0.1874 0.2405 0.1128 1759498
5 5 0.0901 0.136 0.1561 0.1984 0.09458 1631197
12 12 0.0601 0.0841 0.09356 0.1158 0.048 74074
13 13 0.0595 0.0819 0.08936 0.1104 0.04411 30455
14 14 0.0613 0.0829 0.09038 0.1134 0.04188 26418
15 15 0.0572 0.079 0.08533 0.1071 0.0393 14473
16 16 0.0611 0.0798 0.08534 0.1036 0.03466 22320
23 23 0.0552 0.06725 0.07131 0.0857 0.02555 258
24 24 0.0568 0.0681 0.07164 0.08505 0.02991 131
25 25 0.0631 0.0869 0.07687 0.1041 0.03513 377

Data Conculsions

  • Increasing variability in Passengers and distance, this will cause issues with our standard errors.
  • Increasing Distance, or passengers leads to decreased pricing and thus lower yields.
  • There is some variance between distributions of yields when examining flight location and type.

Methodology

Two regressions were created during our attempts to better understand the data and the relationships between our variables. The first uses at most simple transformations such as logs to help reduce heteroskedasticity. While the second employs more abstract calculus transformations in order to linearize any variable previously used that did not initially hold a simple linear pattern with our endogenous variable.

Variable Overview

Our Variables

Variables

Full Pairs charts

panel.cor <- function(x, y, digits=2, prefix="", cex.cor)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits=digits)[1]
txt <- paste(prefix, txt, sep="")
if(missing(cex.cor)) cex <- 0.8/strwidth(txt)
test <- cor.test(x,y)
# borrowed from printCoefmat
Signif <- symnum(test$p.value, corr = FALSE, na = FALSE,
cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1),
symbols = c("***", "**", "*", ".", " "))
text(0.5, 0.5, txt, cex = 1.5 )
text(.7, .8, Signif, cex=cex, col=2)
}

pairs(samp, lower.panel=panel.smooth, upper.panel=panel.cor)

Standard Regression

Initial Regression Model

\[ \underbrace{Y_i}_\text{Itinerary Yield} \underbrace{=}_{\sim} \overbrace{\beta_0}^{\stackrel{\text{y-int}}{\text{Base Yield}}} + \overbrace{\beta_1}^{\stackrel{\text{slope along}}{\text{lPassenger}}} \underbrace{X_{1i}}_\text{lPassenger} + \overbrace{\beta_2}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{2i}}_\text{Distance Group} + \overbrace{\beta_3}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{3i}}_\text{Roundtrip} + \overbrace{\beta_4}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{4i}}_\text{Non-Continguous} +\overbrace{\beta_5}^{\stackrel{\text{change in}}{\text{slope}}} \underbrace{X_{1i}X_{2i}}_\text{lPassenger:Distance Group} + \epsilon_i \]

Results

lm1 <- lm(ITIN_YIELD ~ lPASSENGERS + DISTANCE_GROUP + ROUNDTRIP + ITIN_GEO_TYPE + lPASSENGERS:DISTANCE_GROUP , data= samp)

summary(lm1) %>% pander
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3559 0.002609 136.4 0
lPASSENGERS -0.03338 0.003671 -9.092 1.063e-19
DISTANCE_GROUP -0.03484 0.0005114 -68.12 0
ROUNDTRIPRoundTrip 0.05373 0.002601 20.66 8.305e-94
ITIN_GEO_TYPENon-Continguous Domestic 0.08365 0.005044 16.58 2.355e-61
lPASSENGERS:DISTANCE_GROUP -0.002986 0.0008062 -3.704 0.0002126
Fitting linear model: ITIN_YIELD ~ lPASSENGERS + DISTANCE_GROUP + ROUNDTRIP + ITIN_GEO_TYPE + lPASSENGERS:DISTANCE_GROUP
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
20000 0.1625 0.2265 0.2263
lm1_r2 <- round(summary(lm1)$adj.r.squared, 2)
lm1_RSE <- round(sigma(lm1)*100, 1)
matrix_coef <- summary(lm1)$coefficients
my_estimates <- matrix_coef[ , 1] 
b0 <- round(my_estimates[1]*100, 2)
b1 <- round(my_estimates[2]*100, 2)
b2 <- round(my_estimates[3]*100, 2)
b3 <- round(my_estimates[4]*100, 2)
b4 <- round(my_estimates[5]*100, 2)
b5 <- round(my_estimates[6], 2)

matrix_coef %>% pander(caption= "Results")
Results
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3559 0.002609 136.4 0
lPASSENGERS -0.03338 0.003671 -9.092 1.063e-19
DISTANCE_GROUP -0.03484 0.0005114 -68.12 0
ROUNDTRIPRoundTrip 0.05373 0.002601 20.66 8.305e-94
ITIN_GEO_TYPENon-Continguous Domestic 0.08365 0.005044 16.58 2.355e-61
lPASSENGERS:DISTANCE_GROUP -0.002986 0.0008062 -3.704 0.0002126

Our initial regression model using ordinary least squares results in an \(R^2\) of 0.23, which in the scope of our data is fairly substantial, airline pricing is incredibly varies and involved hundreds of possible factors, we have access to a very limited number of factors and thus are only able to account for total variation to a very limited extent. Though, our residual Standard error is less than ideal when taken in context, an error of 16.3 cents in yields is a large percentage of our total yield range ($0.05-$2), 0.08% of our total range to be specific.

Skipping the y-intercept as its interpretation would make little realistic sense in this case, specific coefficient interpretations are as follows;

  • For every 1% increase in itinerary passengers we see a decline in yield of -3.34 cents

  • For every 500 additional miles on an Itinerary we see a -3.48 cent decline in yield.

  • Roundtrip flights on average provide an additional 5.37 cent yield.

  • Domestic (Non-Continguous) flights on average yield 8.36 cents more per mile.

  • For each 1% increase in passenger count we see a 0 decline in the distance of a flight.

Assumptions

As the data is not a time series we limited our testing to only Heteroskedasticity and multi-collinearity.

Below are the results from a Breush-Pagan Test:

bptest(lm1)
## 
##  studentized Breusch-Pagan test
## 
## data:  lm1
## BP = 647.06, df = 5, p-value < 2.2e-16

Despite the transformations made on passengers, significant error variance is still present. This is likely due to the increasing variability over increasing X as well as miss-specification errors due to omitting significant variables.

Due to concerns about high correlation between our variables we tested for Multi-collinearity as well:

vif(lm1)
##                lPASSENGERS             DISTANCE_GROUP 
##                   3.909970                   1.663725 
##                  ROUNDTRIP              ITIN_GEO_TYPE 
##                   1.245640                   1.288492 
## lPASSENGERS:DISTANCE_GROUP 
##                   3.848542

As none of our values are greater than 10 we should not be worried about multi-collinearity.

Robust Least Squares Model

In order to allow for a true BLUE regression we calculated the coefficients using robust least squares. As shown below the skeleton of the model remains the same though the methods used to calculate coefficients now apply a weighting system assigning less weight to outlying points than standard OLS.

\[ \underbrace{Y_i}_\text{Itinerary Yield} \underbrace{=}_{\sim} \overbrace{\beta_0}^{\stackrel{\text{y-int}}{\text{Base Yield}}} + \overbrace{\beta_1}^{\stackrel{\text{slope along}}{\text{lPassenger}}} \underbrace{X_{1i}}_\text{lPassenger} + \overbrace{\beta_2}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{2i}}_\text{Distance Group} + \overbrace{\beta_3}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{3i}}_\text{Roundtrip} + \overbrace{\beta_4}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{4i}}_\text{Non-Continguous} +\overbrace{\beta_5}^{\stackrel{\text{change in}}{\text{slope}}} \underbrace{X_{1i}X_{2i}}_\text{lPassenger:Distance Group} + \epsilon_i \]

Results

As shown in the output below the relationships of our exogenous variables to our endogenous variable yield remain the same although the degree to which each of these variables affects the yield has somewhat shifted.

Robust Standard errors:

coeftest(lm1, vcov = vcovHC(lm1, type= 'HC1'))
## 
## t test of coefficients:
## 
##                                          Estimate  Std. Error  t value
## (Intercept)                            0.35592279  0.00361278  98.5177
## lPASSENGERS                           -0.03337920  0.00463428  -7.2027
## DISTANCE_GROUP                        -0.03483639  0.00065336 -53.3188
## ROUNDTRIPRoundTrip                     0.05373388  0.00279555  19.2212
## ITIN_GEO_TYPENon-Continguous Domestic  0.08364814  0.00531294  15.7442
## lPASSENGERS:DISTANCE_GROUP            -0.00298634  0.00093041  -3.2097
##                                        Pr(>|t|)    
## (Intercept)                           < 2.2e-16 ***
## lPASSENGERS                           6.114e-13 ***
## DISTANCE_GROUP                        < 2.2e-16 ***
## ROUNDTRIPRoundTrip                    < 2.2e-16 ***
## ITIN_GEO_TYPENon-Continguous Domestic < 2.2e-16 ***
## lPASSENGERS:DISTANCE_GROUP             0.001331 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In addition the the simple robust estimates, due to the extremity of our Breush-Pagan results we felt it would also be useful to calculate 95% confidence intervals for our estimators and be doubly sure that they remained interpretable and useful. As shown all estimates retain the same signs and are thus safe to include and utilize in a model.

Robust Coefficients at 95% confidence:

coefci(lm1, vcov = vcovHC(lm1, type= 'HC1'))
##                                              2.5 %       97.5 %
## (Intercept)                            0.348841448  0.363004136
## lPASSENGERS                           -0.042462776 -0.024295627
## DISTANCE_GROUP                        -0.036117028 -0.033555749
## ROUNDTRIPRoundTrip                     0.048254375  0.059213388
## ITIN_GEO_TYPENon-Continguous Domestic  0.073234335  0.094061939
## lPASSENGERS:DISTANCE_GROUP            -0.004810024 -0.001162649

Transformed Regression

In this transformed model the non-simple linear relationship between distance group, miles flown and yields was transformed into a simple linear relationship, refer to variable overview. The implications of this are further expanded upon below.

Initial Regression Model

\[ \underbrace{Y_i}_\text{Itinerary Yield} \underbrace{=}_{\sim} \overbrace{\beta_0}^{\stackrel{\text{y-int}}{\text{Base Yield}}} + \overbrace{\beta_1}^{\stackrel{\text{slope along}}{\text{lPassenger}}} \underbrace{X_{1i}}_\text{lPassenger} + \overbrace{\beta_2}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{2i}}_\text{SQRT_1over_DG_x_MF} + \overbrace{\beta_3}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{3i}}_\text{Roundtrip} + \overbrace{\beta_4}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{4i}}_\text{Non-Continguous} +\overbrace{\beta_5}^{\stackrel{\text{change in}}{\text{slope}}} \underbrace{X_{1i}X_{2i}}_\text{lPassenger:SQRT_1over_DG_x_MF} + \epsilon_i \]

So as to best maintain the ability to compare the two regression all variables where kept the same except for the replacement of Distance_Group with the new transformed variable.

Results

lm2 <- lm(ITIN_YIELD ~ lPASSENGERS + SQRT_1over_DG_x_MF + ROUNDTRIP + ITIN_GEO_TYPE + lPASSENGERS:SQRT_1over_DG_x_MF, data= samp)
summary(lm2) %>% pander
Table continues below
  Estimate Std. Error t value
(Intercept) -0.0002572 0.00261 -0.09854
lPASSENGERS -0.02186 0.002658 -8.224
SQRT_1over_DG_x_MF 12.88 0.1116 115.4
ROUNDTRIPRoundTrip 0.068 0.002126 31.98
ITIN_GEO_TYPENon-Continguous Domestic 0.006291 0.003839 1.639
lPASSENGERS:SQRT_1over_DG_x_MF -1.704 0.12 -14.2
  Pr(>|t|)
(Intercept) 0.9215
lPASSENGERS 2.095e-16
SQRT_1over_DG_x_MF 0
ROUNDTRIPRoundTrip 6.71e-219
ITIN_GEO_TYPENon-Continguous Domestic 0.1013
lPASSENGERS:SQRT_1over_DG_x_MF 1.475e-45
Fitting linear model: ITIN_YIELD ~ lPASSENGERS + SQRT_1over_DG_x_MF + ROUNDTRIP + ITIN_GEO_TYPE + lPASSENGERS:SQRT_1over_DG_x_MF
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
20000 0.1375 0.4461 0.446
lm2_r2 <- round(summary(lm2)$adj.r.squared, 2)
lm2_RSE <- round(sigma(lm1)*100, 1)
matrix_coef <- summary(lm2)$coefficients
my_estimates <- matrix_coef[ , 1] 
b0 <- round(my_estimates[1]*100, 2)
b1 <- round(my_estimates[2]*100, 2)
b2 <- round(my_estimates[3], 2)
b3 <- round(my_estimates[4]*100, 2)
b4 <- round(my_estimates[5]*100, 2)
b5 <- round(my_estimates[6], 2)

Our transformed regression model using ordinary least squares results in an \(R^2\) of 0.45, which in the scope of our data is fairly substantial, airline pricing is incredibly varies and involved hundreds of possible factors, we have access to a very limited number of factors and thus are only able to account for total variation to a very limited extent. Though, our residual Standard error is less than ideal when taken in context, an error of 16.3 cents in yields is a large percentage of our total yield range ($0.05-$2), 0.08% of our total range to be specific. The Primary issue with this is that we lose the ability to effectively interpret a change in distance due to the complexity of the transformation.

Skipping the y-intercept as its interpretation would make little realistic sense in this case, specific coefficient interpretations are as follows;

  • For every 1% increase in itinerary passengers we see a decline in yield of -2.19 cents

  • For every 1 unit increase in \(\text{(Miles Flown * Distance group)}^{-\frac{1}{2}}\) on an Itinerary we see a 12.88 dollar increase in yield.

  • Roundtrip flights on average provide an additional 6.8 cent yield.

  • Domestic (Non-Continguous) flights on average yield 0.63 cents more per mile, but are no longer significant.

  • For each 1% increase in passenger count we see a -1.7 unit decline in \(\text{(Miles Flown * Distance group)}^{-\frac{1}{2}}\) of a flight.

Assumptions

As the data is not a time series we limited our testing to only Heteroskedasticity and multi-collinearity.

bptest(lm2)
## 
##  studentized Breusch-Pagan test
## 
## data:  lm2
## BP = 2858.2, df = 5, p-value < 2.2e-16

Despite the transformations made on passengers and the attempt to linearize Distance, significant error variance is still present, in this case even more so than before. This is likely due to the increasing variability over increasing X as well as miss-specification errors due to omitting significant variables.

Due to concerns about high correlation between our variables we tested for Multi-collinearity as well:

vif(lm2)
##                    lPASSENGERS             SQRT_1over_DG_x_MF 
##                       2.863075                       1.530031 
##                      ROUNDTRIP                  ITIN_GEO_TYPE 
##                       1.162086                       1.042313 
## lPASSENGERS:SQRT_1over_DG_x_MF 
##                       3.347589

As none of our values are greater than 10 we should not be worried about multi-collinearity.

Robust Least Squares Model

Results

Again due to the issues found in our assumptions we calculated Robust standard errors to use rather than traditional OLS.

Robust Standard errors:

As shown in the output below the relationships of our exogenous variables to our endogenous variable yield remain the same although the degree to which each of these variables affects the yield has somewhat shifted.

coeftest(lm2, vcov = vcovHC(lm2, type= 'HC1'))
## 
## t test of coefficients:
## 
##                                          Estimate  Std. Error t value  Pr(>|t|)
## (Intercept)                           -0.00025718  0.00384553 -0.0669   0.94668
## lPASSENGERS                           -0.02186187  0.00317125 -6.8938 5.595e-12
## SQRT_1over_DG_x_MF                    12.87795718  0.23807526 54.0920 < 2.2e-16
## ROUNDTRIPRoundTrip                     0.06799543  0.00231664 29.3509 < 2.2e-16
## ITIN_GEO_TYPENon-Continguous Domestic  0.00629125  0.00318185  1.9772   0.04803
## lPASSENGERS:SQRT_1over_DG_x_MF        -1.70420573  0.21591821 -7.8928 3.105e-15
##                                          
## (Intercept)                              
## lPASSENGERS                           ***
## SQRT_1over_DG_x_MF                    ***
## ROUNDTRIPRoundTrip                    ***
## ITIN_GEO_TYPENon-Continguous Domestic *  
## lPASSENGERS:SQRT_1over_DG_x_MF        ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Robust Coefficients at 95% confidence:

Again, in addition the the simple robust estimates, due to the extremity of our Breush-Pagan results we felt it would also be useful to calculate 95% confidence intervals for our estimators and be doubly sure that they remained interpretable and useful. As shown all estimates retain the same signs and are thus safe to include and utilize in a model with the exception of our intercept and geography types.

coefci(lm2, vcov = vcovHC(lm2, type= 'HC1'))
##                                               2.5 %       97.5 %
## (Intercept)                           -7.794734e-03  0.007280367
## lPASSENGERS                           -2.807777e-02 -0.015645967
## SQRT_1over_DG_x_MF                     1.241131e+01 13.344604358
## ROUNDTRIPRoundTrip                     6.345462e-02  0.072536232
## ITIN_GEO_TYPENon-Continguous Domestic  5.456693e-05  0.012527936
## lPASSENGERS:SQRT_1over_DG_x_MF        -2.127423e+00 -1.280988205

Regression plot

#Graph Resolution (more important for more complex shapes)
graph_reso <- 0.025

#Setup Axis
axis_x <- seq(min(samp$DISTANCE_GROUP), max(samp$DISTANCE_GROUP), by = graph_reso)
axis_y <- seq(min(samp$lPASSENGERS), max(samp$lPASSENGERS), by = graph_reso)
axis_col <- as.factor(c("One-Way", "RoundTrip"))
axis_f <- as.factor(c("Continguous Domestic", "Non-Continguous Domestic"))

#Sample points
lmnew <- expand.grid(DISTANCE_GROUP = axis_x, lPASSENGERS = axis_y, ROUNDTRIP = axis_col, ITIN_GEO_TYPE = axis_f ,  KEEP.OUT.ATTRS=F)
lmnew$Z <- predict.lm(lm1, newdata = lmnew)
lmnew <- acast(lmnew, lPASSENGERS ~ DISTANCE_GROUP , value.var = "Z") #y ~ x
samp %>% 
  filter(ITIN_GEO_TYPE == "Continguous Domestic") %>%
  plot_ly(., 
               x = ~DISTANCE_GROUP, 
               y = ~lPASSENGERS, 
               z = ~ITIN_YIELD, 
               #text = rownames(samp %>% drop_na()),
               type = "scatter3d",
               mode ="markers",
               color = ~as.factor(ROUNDTRIP),
               alpha= 0.7) %>%
              layout(title= list(text = "Continguous Domestic Flights (Lower 48)"))
samp %>% 
  filter(ITIN_GEO_TYPE == "Non-Continguous Domestic") %>%
  plot_ly(., 
               x = ~DISTANCE_GROUP, 
               y = ~lPASSENGERS, 
               z = ~ITIN_YIELD, 
               #text = rownames(samp %>% drop_na()),
               type = "scatter3d",
               mode ="markers",
               color = ~as.factor(ROUNDTRIP),
               alpha= 0.7) %>%
              layout(title= list(text = "Non-Continguous Domestic Flights (Outside Lower 48)"))

Conclusions and Avenues for Future Research

Rough idea ->

increasing passengers does lead to decreasing profits, likely through the assumed discounts that occur from bulk purchasing.

increasing distances also lead to reducing profits as the flight lasts longer. this is likely related to fixed costs as a percentage of total costs

lastly when comparing one-way vs round trips we see that continguous flights are more likely to provide greater profits on one-way flights relative to non-continguous flights


Literature Cited